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Summary. We present in this paper our work regarding simulating a type of P sys- 
tem known as a spiking neural P system (SNP system) using graphics processing units 
(GPUs). GPUs, because of their architectural optimization for parallel computations, 
are well-suited for highly parallelizable problems. Due to the advent of general purpose 
GPU computing in recent years, GPUs are not limited to graphics and video processing 
alone, but include computationally intensive scientific and mathematical applications as 
well. Moreover P systems, including SNP systems, are inherently and maximally parallel 
computing models whose inspirations are taken from the functioning and dynamics of a 
living cell. In particular, SNP systems try to give a modest but formal representation of 
a special type of cell known as the neuron and their interactions with one another. The 
nature of SNP systems allowed their representation as matrices, which is a crucial step 
in simulating them on highly parallel devices such as GPUs. The highly parallel nature 
of SNP systems necessitate the use of hardware intended for parallel computations. The 
simulation algorithms, design considerations, and implementation are presented. Finally, 
simulation results, observations, and analyses using an SNP system that generates all 
numbers in N - {1} are discussed, as well as recommendations for future work. 

Key words: Membrane computing, Parallel computing, GPU computing 
1 Introduction 

1.1 Parallel computing: Via graphics processing units (GPUs) 

The trend for massively parallel computation is moving from the more common 
multi-core CPUs towards GPUs for several significant reasons [HI [14] . One impor- 



2 



F. Cabarle, H. Adorna, M. Martmez-del-Amor 



tant reason for such a trend in recent years include the low consumption in terms 
of power of GPUs compared to setting up machines and infrastructure which will 
utilize multiple CPUs in order to obtain the same level of parallelization and per- 
formance |15j . Another more important reason is that GPUs are architectured 
for massively parallel computations since unlike most general purpose multicore 
CPUs, a large part of the architecture of GPUs are devoted to parallel execu- 
tion of arithmetic operations, and not on control and caching just like in CPUs 
[T51 RH] . Arithmetic operations are at the heart of many basic operations as well 
as scientific computations, and these are performed with larger speedups when 
done in parallel as compared to performing them sequentially. In order to per- 
form these arithmetic operations on the GPU, there is a set of techniques called 
GPGPU (General Purpose computations on the GPU) coined by Mark Harris in 
2002 which allows programmers to do computations on GPUs and not be limited 
to just graphics and video processing alone [T]. 

1.2 Parallel computing: Via Membranes 

Membrane computing or its more specific counterpart, a P system, is a Turing 
complete computing model (for several P system variants) that perform computa- 
tions nondeterministically, exhausting all possible computations at any given time. 
This type of unconventional model of computation was introduced by Gheorghe 
Paun in 1998 and takes inspiration and abstraction, similar to other members 
of Natural computing (e.g. DNA/molecular computing, neural networks, quantum 
computing) , from nature [6l [7] . Specifically, P systems try to mimic the consti- 
tution and dynamics of the living cell: the multitude of elements inside it, and 
their interactions within themselves and their environment, or outside the cell's 
skin (the cell's outermost membrane). Before proceeding, it is important to clarify 
what is meant when it is said that nature computes, particularly life or the cell: 
computation in this case involves reading information from memory from past or 
present stimuli, rewrite and retrieve this data as a stimuli from the environment, 
process the gathered data and act accordingly due to this processing [2] . Thus, we 
try to extend the classical meaning of computation presented by Allan Turing. 

SN P systems differ from other types of P systems precisely because they 
are mono — membranar and the working alphabet contains only one object type. 
These characteristics, among others, are meant to capture the workings of a special 
type of cell known as the neuron. Neurons, such as those in the human brain, 
communicate or 'compute' by sending indistinct signals more commonly known 
as action potential or spikes [3] . Information is then communicated and encoded 
not by the spikes themselves, since the spikes are unrecognizable from one another, 
but by (a) the time elapsed between spikes, as well as (b) the number of spikes 
sent /received from one neuron to another, oftentimes under a certain time interval 

®- 

It has been shown that SN P systems, given their nature, are representable by 
matrices [H [5] . This representation allows design and implementation of an SN P 
system simulator using parallel devices such as GPUs. 
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1.3 Simulating SNP systems in GPUs 

Since the time P systems were presented, many simulators and software applica- 
tions have been produced |10j . In terms of High Performance Computing, many 
P system simulators have been also designed for clusters of computers [TT], for 
reconfigurable hardware as in FPGAs [J2] , and even for GPUs E] • All of these 
efforts have shown that parallel architectures are well-suited in performance to 
simulate P systems. However, these previous works on hardware are designed to 
simulate cell-like P system variants, which are among the first P system variants 
to have been introduced. Thus, the efficient simulation of SNP systems is a new 
challenge that requires novel attempts. 

A matrix representation of SN P systems is quite intuitive and natural due 
to their graph-like configurations and properties (as will be further shown in the 



succeeding sections such as in subsection 2.1| 



On the other hand, linear algebra operations have been efficiently implemented 
on parallel platforms and devices in the past years. For instance, there is a large 
number of algorithms implementing matrix — matrix and vector — matrix oper- 
ations on the GPU. These algorithms offer huge performance since dense linear 
algebra readily maps to the data-parallel architecture of GPUs [Ttjl [TT] . 

It would thus seem then that a matrix represented SN P system simulator 
implementation on highly parallel computing devices such as GPUs be a natural 
confluence of the earlier points made. The matrix representation of SN P systems 
bridges the gap between the theoretical yet still computationally powerful SN 
P systems and the applicative and more tangible GPUs, via an SN P system 
simulator. 

The design and implementation of the simulator, including the algorithms 
deviced, architectural considerations, are then implemented using CUDA. The 
Compute Unified Device Architecture (CUDA) programming model, launched by 
NVIDIA in mid-2007, is a hardware and software architecture for issuing and man- 
aging computations on their most recent GPU families (G80 family onward), mak- 
ing the GPU operate as a highly parallel computing device [TS]. CUDA program- 
ming model extends the widely known ANSI C programming language (among 
other languages which can interface with CUDA), allowing programmers to easily 
design the code to be executed on the GPU, avoiding the use of low-level graph- 
ical primitives. CUDA also provides other benefits for the programmer such as 
abstracted and automated scaling of the parallel executed code. 

This paper starts out by introducing and defining the type of SNP system 
that will be simulated. Afterwards the NVIDIA CUDA model and architecture 
are discussed, baring the scalability and parallelization CUDA offers. Next, the 
design of the simulator, constraints and considerations, as well as the details of 
the algorithms used to realize the SNP system are discussed. The simulation results 
are presented next, as well as observations and analysis of these results. The paper 
ends by providing the conclusions and future work. 

The objective of this work is to continue the creation of P system simulators , 
in this particular case an SN P system, using highly parallel devices such as GPUs. 
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Fidelity to the computing model (the type of SNP system in this paper) is a part 
of this objective. 

2 Spiking neural p systems 
2.1 Computing with SN P systems 

The type of SNP systems focused on by this paper (scope) are those without delays 
i.e. those that spike or transmit signals the moment they are able to do so [H|S]. 
Variants which allow for delays before a neuron produces a spike, are also available 
[3]. An SNP system without delay is of the form: 

Definition 1. 

II = (0,o-i, ... , a m , syn,in, out), 

where: 

1.0 = {a} is the alphabet made up of only one object, the system spike a. 

2. o\ , . . . , a m are m number of neurons of the form 

g 1 = (ni,Ri), 1 < i < m, 

where: 

a) n.i > gives the initial number of as i.e. spikes contained in neuron o~i 

b) Ri is a finite set of rules of with two forms: 

(b-1) E/a c — > aP , are known as Spiking rules, where E is a regular expression 
over a, and c > 1, such that p > 1 number of spikes are produced, 
one for each adjacent neuron with ai as the originating neuron and 
a c € L{E). 

(b-2) a s —¥ A, are known as Forgetting rules, for s > 1, such that for each 

rule E/a c — >• a of type (b-1) from R i; a s L{E). 
(b-3) a k — > a, a special case of (b-1) where E = a c , k > c. 

3. syn = {(i,j) \ 1 < i,j < m, i ^ j } are the synapses i.e. connection between 
neurons. 

4- in, out £ {1, 2, . . . , m} are the input and output neurons, respectively. 

Furthermore, rules of type (b-1) are applied if contains k spikes, a k £ L(E) 
and k > c. Using this type of rule uses up or consumes k spikes from the neuron, 
producing a spike to each of the neurons connected to it via a forward pointing 
arrow i.e. away from the neuron. In this manner, for rules of type (b-2) if oi 
contains s spikes, then s spikes are forgotten or removed once the rule is used. 

The non-determinism of SN P systems comes with the fact that more than 
one rule of the several types are applicable at a given time, given enough spikes. 
The rule to be used is chosen non-deterministically in the neuron. However, only 
one rule can be applied or used at a given time |4j [5] . The neurons in an SN 
P system operate in parallel and in unison, under a global clock [3] . For Figure [I] 
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no input neuron is present, but neuron 3 is the output neuron, hence the arrow 
pointing towards the environment, outside the SNP system. 

The SN P system in Figure[T]is 77, a 3 neuron system whose neurons are labeled 
(neuron l/a± to neuron 3/173) and whose rules have a total system ordering from 
(1) to (5). Neuron 1/ai can be seen to have an initial number of spikes equal to 
2 (hence the a 2 seen inside it). There is no input neuron, but 03 is the output 
neuron, as seen by the arrow pointing towards the environment (not to another 
neuron). More formally, 77 can be represented as follows: 

H = ({ a }i <7 ii CT 2) 03) syn, out) where a% = (2,7?i), n\ = 2, R\ = {a 2 /a — > 
a}, (neurons 2 to 3 and their n^s and RiS can be similarly shown), syn = 
{(1,2), (1,3), (2, 1), (2,3)} are the synapses for 77, out = 03. This SN P system 
generates all numbers in the set N - {1}, hence it doesn't halt, which can be eas- 
ily verified by applying the rules in 77, and checking the spikes produced by the 
output neuron 173. This generated set is the result of the computation in 77. 



a la-* a (1 



ma (4) 
a-^\ (5) 



a->a (3) 



Fig. 1. An SNP P system 77, generating all numbers in N - {!}, from [5]. 



2.2 Matrix representation of SNP systems 

A matrix representation of an SN P system makes use of the following vectors and 
matrix definitions [HE] . It is important to note that, just as in Figure [l] a total 
ordering of rules is considered. 

Configuration vector Ck is the vector containing all spikes in every neuron on 
the kth computation step/time, where Co is the initial vector containing all spikes 
in the system at the beginning of the computation. For 77 (in Figure [T]) the initial 
configuration vector is Co =< 2, 1, 1 >. 

Spiking vector shows at a given configuration Ck, if a rule is applicable (has 
value 7) or not (has value instead). For 77 we have the spiking vector Sk =< 
1,0,1,1,0 > given Cq. Note that a 2nd spiking vector, Sk =< 1,0,1,1,0 >, is 
possible if we use rule (2) over rule (1) instead (but not both at the same time, 
hence we cannot have a vector equal to < 1,1,1,1,0 >, so this Sk is invalid ). 
Validity in this case means that only one among several applicable rules is used 
and thus represented in the spiking vector. We can have all the possible vectors 
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composed of Os and Is with length equal to the number of rules, but have only 
some of them be valid, given by \P later at subsection |4. 2 1 

Spiking transition matrix Mjj is a matrix comprised of a^- elements where a^- 
is given as 



Definition 2. 



-c, rule Ti is in <jj and is applied consuming c spikes; 
p, rule ri is in a s (s ^ j and (s, j) S syn) 

and is applied producing p spikes in total; 
0, rule ri is in a s (s ^ j and (s,j) ^ syn). 



For 77, the Mn is as follows: 
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In such a scheme, rows represent rules and columns represent neurons. 
Finally, the following equation provides the configuration vector at the (k+l)th 
step, given the configuration vector and spiking vector at the kth step, and Mn'- 



Ck+i — Ck + Sk ■ M, 



IT- 



(2) 



3 The NVIDIA CUDA architecture 

NVIDIA, a well known manufacturer of GPUs, released in 2007 the CUDA pro- 
gramming model and architecture [15] . Using extensions of the widely known C 
language, a programmer can write parallel code which will then execute in multi- 
ple threads within multiple thread blocks, each contained within a grid of (thread) 
blocks. These grids belong to a single device i.e. a single GPU. Each device/ GPU 
has multiple cores, each capable of running its own block of threads The program 
run in the CUDA model scales up or down, depending on the number of cores 
the programmer currently has in a device. This scaling is done in a manner that 
is abstracted from the user, and is efficiently handled by the architecture as well. 
Automatic and efficient scaling is shown in Figure [2] Parallelized code will run 
faster with more cores than with fewer ones |14j . 

Figure [3] shows another important feature of the CUDA model: the host and 
the device parts. The host controls the execution flow while the device is a highly- 
parallel co-processor. Device pertains to the GPU/s of the system, while the host 
pertains to the CPU/s. A function known as a kernel function, is a function called 
from the host but executed in the device. 

A general model for creating a CUDA enabled program is shown in Listing [T] 
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Multithreaded CUDA Program 
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Fig. 2. NVIDIA CUDA automatic scaling, hence more cores result to faster execution, 
from [14] . 
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Fig. 3. NVIDIA CUDA programming model showing the sequential execution of the 
host code alongside the parallel execution of the kernel function on the device side, from 
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Listing 1. General code flow for CUDA programming written in the CUDA extended C 
language 

/ / allocate memory on GPU e.g. 

cudaMalloc ( ( void**)&dev_a , N * sizeof(int) 

//populate arrays 

//copy arrays from host to device e.g. 
cudaMemcpy ( dev N * sizeof(int), 

cudaMemcpyHostToDevice ) 

//call kernel (GPU) function e.g. 

add«<N, 1»>( dev_a , dev_b , dev_c ); 

// copy arrays from device to host e.g. 
cudaMemcpy ( c, dev_c , N * sizeof( int ) , 
cudaMemcpyDeviceToHost ) 

//display results 

/ / free memory e.g. 
cudaFree( dev_a ); 

Lines 2 and 21, implement CUDA versions of the standard C language functions 
e.g. the standard C function malloc has the CUDA C function counterpart being 
cudaM alloc, and the standard C function free has cudaFree as its CUDA C 
counterpart. 

Lines 8 and 15 show a CUDA C specific function, namely cudaMemcpy, which, 
given an input of pointers ( from Listing [I] host code pointers are single letter vari- 
ables such as a and c,while device code variable counterparts are prefixed by dev _ 
such as dev-a and dev-C ) and the size to copy ( as computed by the sizeof func- 
tion ), moves data from host to device ( parameter cudaMemcpyHostToDevice ) 
or device to host ( parameter cudaMemcpyDeviceToHost). 

A kernel function call uses the triple < and > operator, in this case the kernel 
function 

add <« N,l »>(dev_a 1 devJj, dev_c). 

This function adds the values, per element (and each element is associated to 
1 thread), of the variables dev -a and dev J) sent to the device, collected in variable 
dev.c before being sent back to the host/CPU. The variable N in this case allows 
the programmer to specify N number of threads which will execute the add kernel 
function in parallel, with 1 specifying only one block of thread for all N threads. 
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3.1 Design considerations for the hardware and software setup 

Since the kernel function is executed in parallel in the device, the function needs 
to have its inputs initially moved from the CPU/host to the device, and then back 
from the device to the host after computation for the results. This movement 
of data back and forth should be minimized in order to obtain more efficient, in 
terms of time, execution. Implementing an equation such as |2]), which involves 
multiplication and addition between vectors and a matrix, can be done in parallel 
with the previous considerations in mind. In this case, Cfc, Sk, and Mn are loaded, 
manipulated, and pre-processed within the host code, before being sent to the 
kernel function which will perform computations on these function arguments in 
parallel. To represent C^, Sk, and Mjj, text files are created to house each input, 
whereby each element of the vector or matrix is entered in the file in order, from 
left to right, with a blank space in between as a delimiter. The matrix however is 
entered in row-major ( a linear array of all the elements, rows first, then columns) 
order format i.e. for the matrix Mn seen in (JlJ, the row-major order version is 
simply 

-1,1,1,-2,1,1,1,-1,1,0,0,-1,0,0,-2 (3) 

Row major ordering is a well-known ordering and representation of matrices for 
their linear as well as parallel manipulation in corresponding algorithms [13] . Once 
all computations are done for the (fc + l)th configuration, the result of equation 
([2]) are then collected and moved from the device back to the host, where they can 
once again be operated on by the host/CPU. It is also important to note that these 
operations in the host/CPU provide logic and control of the data/inputs, while 
the device/GPU provides the arithmetic or computational 'muscle', the laborious 
task of working on multiple data at a given time in parallel, hence the current 
dichotomy of the CUDA programming model [S]. The GPU acts as a co-processor 
of the central processor. This division of labor is observed in Listing [T] . 

3.2 Matrix computations and CPU-GPU interactions 

Once all 3 initial and necessary inputs are loaded, as is to be expected from equa- 
tion [2| the device is first instructed to perform multiplication between the spiking 
vector Sk and the matrix Mn- To further simplify computations at this point, 
the vectors are treated and automatically formatted by the host code to appear 
as single row matrices, since vectors can be considered as such. Multiplication is 
done per element (one element is in one thread of the device/GPU), and then the 
products are collected and summed to produce a single element of the resulting 
vector/single row matrix. 

Once multiplication of the Sk and M n is done, the result is added to the 
Cfe, once again element per element, with each element belonging to one thread, 
executed at the same time as the others. 

For this simulator, the host code consists largely of the programming language 
Python, a well-known high- level, object oriented programming (OOP) language. 
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The reason for using a high-level language such as Python is because the initial 
inputs, as well as succeeding ones resulting from exhaustively applying the rules 
and equation ^ require manipulation of the vector/matrix elements or values as 
strings. The strings are then concatenated, checked on (if they conform to the 
form (b-3) for example) by the host, as well as manipulated in ways which will 
be elaborated in the following sections along with the discussion of the algorithm 
for producing all possible and valid SkS and C^s given initial conditions. The host 
code/Python part thus implements the logic and control as mentioned earlier, 
while in it, the device/GPU code which is written in C executes the parallel parts 
of the simulator for CUDA to be utilized. 



4 Simulator design and implementation 

The current SNP simulator, which is based on the type of SNP systems without 
time delays, is capable of implementing rules of the form (b-3) i.e. whenever the 
regular expression E is equivalent to the regular expression a k in that rule. Rules 
are entered in the same manner as the earlier mentioned vectors and matrix, as 
blank space delimited values (from one rule to the other, belonging to the same 
neuron) and $ delimited ( from one neuron to the other) . Thus for the SNP system 
77 shown earlier, the file r containing the blank space and $ delimited values is as 
follows: 

2 2 $ 1 $ 1 2 (4) 

That is, rule (1) from Figure [I] has the value 2 in the file r (though rule (1) 
isn't of the form (b-3) it nevertheless consumes a spike since its regular expression 
is of the same regular expression type as the rest of the rules of 77 ). Another 
implementation consideration was the use of lists in Python, since unlike dic- 
tionaries or tuples, lists in Python are mutable, which is a direct requirement of 
the vector/matrix element manipulation to be performed later on (concatenation 
mostly). Hence a Ck =< 2, 1, 1 > is represented as [2, 1, 1] in Python. That is, at 
the kth configuration of the system, the number of spikes of neuron 1 are given 
by accessing the index (starting at zero) of the configuration vector Python list 
variable confVec, in this case if 

confVec = [2, 1, 1] (5) 

then confVec[0] = 2 gives the number of spikes available at that time for 
neuron 1, confVec[l] = 1 for neuron 2, and so on. The file r, which contains the 
ordered list of neurons and the rules that comprise each of them, is represented as 
a list of sub- lists in the Python/host code. For SNP system 77 and from Q we 
have the following: 

r=[[2,2],[l],[l,2]] (6) 

Neuron l's rules are given by accessing the sub-lists of r (again, starting at 
index zero) i.e. rule (1) is given by r[0][0] = 2 and rule (4) is given by r[2][l] = 1. 
Finally, we have the input file M, which holds the Python list version of {§. 
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4.1 Simulation algorithm implementation 

The general algorithm is shown in Algorithm [T] Each line in Algorithm [T] mentions 
which part/s the simulator code runs in, either in the device (DEVICE) or in 
the host (HOST) part. Step IV of Algorithm [I] makes the algorithm stop with 2 
stopping criteria to do this: 

One is when there are no more available spikes in the system (hence a zero 
value for a configuration vector), and the second one being the fact that all pre- 
viously generated configuration vectors have been produced in an earlier time or 
computation, hence using them again in part I of Algorithm [l] would be pointless, 
since a redundant, infinite loop will only be formed. 



Algorithm 1 Overview of the algorithm for the SNP system simulator 
Require: Input files: confVec, M, r. 

I. (HOST) Load input files. Note that M and r need only be loaded once since they 
are unchanging, Co is loaded once, and then CkS are loaded afterwards. 

II. (HOST) Determine if a rule/element in r is applicable based on its corresponding 
spike value in confVec, and then generate all valid and possible spiking vectors in a 
list of lists spikVec given the 3 initial inputs. 

III. (DEVICE) From part II., run the kernel function on spikVec, which contains all 
the valid and possible spiking vectors for the current confVec and r. This will generate 
the succeeding CfcS and their corresponding SkS. 

IV. (HOST+DEVICE) Repeat steps I to IV (except instead of loading Co as 
confVec, use the generated C^s in III) until a zero configuration vector (vector with 
only zeros as elements) or further CfcS produced are repetitions of a Ck produced at 



an earlier time. (Stopping criteria in subsection 4.1 



Another important point to notice is that either of the stopping criterion from 
|4.1| could allow for a deeply nested computation tree, one that can continue exe- 
cuting for a significantly lengthy amount of time even with a multi-core CPU and 
even the more parallelized GPU. 

4.2 Closer inspection of the SN P system simulator 

The more detailed algorithm for part 77 of Algorithm [T] is as follows. 

Recall from the definition of an SNP system (Definitin [lj that we have m 
number of as. We related m to our implementation by noticing the cardinality of 
the Python list r. 

|r| = m (7) 
<P =\<t Vi \\o- V2 \...\<j v J (8) 



where 
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means the total number of rules in the mth neuron which satisfy the regular 
expresion E in (b-3). m gives the total number of neurons, while W gives the 
expected number of valid and possible SkS which should be produced in a given 
configuration. We also define u as both the largest and last integer value in the sub- 
list (neuron) created in step II of Algorithm [l] and further detailed in Algorithm 
[2j which tells us how many elements of that neuron satisfy E. 

During the exposition of the algorithm, the previous Python lists (from their 
vector/matrix counterparts in earlier sections) ^ and ([6| will be utilized. For part 
II Algorithm [T] we have a sub-algorithm (Algorithm [2| for generating all valid and 
possible spiking vectors given input files M, confVec, and r. 



Algorithm 2 Algorithm further detailing part II in Algorithm [T] 
II- 1. Create a list trap, a copy of r, marking each element of tmp in increasing order of 
N, as long as the element/s satisfy the rule's regular expression E of a rule (given by 
list r ). Elements that don't satisfy E are marked with 0. 
II-2. To generate all possible and valid spiking vectors from tmp, we go through each neu- 
ron i.e. all elements of tmp, since we know a priori m as well as the number of elements 
per neuron which satisfy E. We only need to iterate through each neuron/element of 
tmp, lo times, (from II- 1 ) . We then produce a new list, tmp2, which is made up of 
a sub-list of strings from all possible and valid {1,0} strings i.e. spiking vectors per 
neuron. 

II-3. To obtain all possible and valid {1,0} strings (S&s), given that there are multiple 
strings to be concatenated ( as in tmp2 , s case ), pairing up the neurons first, in order, 
and then exhaustively distributing every element of the first neuron to the elements of 
the 2nd one in the pair. These paired-distributed strings will be stored in a new list, 
trapi. 



Algorithm [2] ends once all {1,0} have been paired up to one another. As an 
illustration of Algorithm [2j consider and ([I]) as inputs to our SNP system 

simulator. The following details the production of all valid and possible spiking 
vectors using Algorithm [2j 

Initially from II- 1 of Algorithm |2j we have 

r =tmp= [[2, 2], [1], [1,2]]. 

Proceeding to illustrate II-2 we have the following passes. 

1st pass: tmp = [[1,2], [1], [1,2]] 
Remark/s: previously, tmp [0][0] was equal to 2, but now has been changed to 1, 
since it satisfies E ( configVec[0] = 2 w/c is equal to 2, the number of spikes 
consumed by that rule).^ 

2nd pass: tmp = [[1,2], [1], [1,2]] 
Remark/s: previously fmp[0][l] = 2, which has now been changed (incidentally) 
to 2 as well, since it's the 2nd element of a\ which satisfies E. 

3rd pass: tmp = [[1, 2], [1], [1, 2]] 
Remark/s: 1st (and only) element of neuron 2 which satisfies E. 
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4th pass: tmp = [[1, 2], [1], [1, 2]] 
Remark/ 's: Same as the 1st pass 

5th pass: tmp = [[1,2], [1], [1,0]] 
Remark/s: element imp[2][l], or the 2nd element/rule of neuron 3 doesn't satisfy 
E. 

Final result: tmp = [[1,2], [1], [1,0]] 

At this point we have the following, based on the earlier definitions: 
m = 3 ( 3 neurons in total, one per element/value of confVec) 
<P = \<t Vi \\<tv 2 \Wv 3 \ = 2*1*1 = 2 

\P tells us the number of valid strings of is and 0s i.e. S^s, which need to be 
produced later, for a given Ck which in this case is confvec. There are only 2 valid 
Sfcs/spiking vectors from ([5| and the rules given in ([6| encoded in the Python list 
r. These S^s are 

< 0,1,1,1,0 > (9) 

< 1,0,1,1,0 > (10) 

In order to produce all SkS in an algorithmic way as is done in Algorithm [2] , it's 
important to notice that first, all possible and valid S&S ( made up of Is and 0s) 
per a have to be produced first, which is facilitated by II-l of Algorithm [2] and its 
output (the current value of the list tmp ). 

Continuing the illustration of II- 1, and illustrating II-2 this time, we iterate 
over neuron 1 twice, since its lo = 2, i.e. neuron 1 has only 2 elements which 
satisfy E, and consequently, it is its 2nd element, 

tmp[0][l] = 2. 

For neuron 1, our first pass along its elements/list is as follows. Its 1st element, 
imp[0][0] = 1 

is the first element to satisfy E, hence it requires a 1 in its place, and in the 
others. We therefore produce the string ' W for it. Next, the 2nd element satisfies 
E and it too, deserves a 1, while the rest get 0s. We produce the string '01 ' for it. 

The new list, tmp2, collecting the strings produced for neuron 1 therefore 
becomes 

tmp2 = [[10,01]] 

Following these procedures, for neuron 2 we get tmp2 to be as follows: 
*mp2=[[10,01],[l]] 

Since neuron 2 which has only one element only has 1 possible and valid string, 
the string 1. Finally, for neuron 3, we get tmp2 to be 
imp2 = [[10,01], [1], [10]] 

In neuron 3, we iterated over it only once because w, the number of elements 
it has which satisfy E, is equal to 1 only. Observe that the sublist 
tmp2[0] = [10,01] 

is equal to all possible and valid {1,0} strings for neuron 1, given rules in ^ 
and the number of spikes in configVec. 

Illustrating II-3 of Algorithm [2] given the valid and possible {1,0} strings 
(spiking vectors) for neurons 1, 2, and 3 (separated per neuron-column) from ([5]) 
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and ^ and from the illustration of II-2, all possible and valid list of {1,0} string/s 
for neuron 1: ['10', '01'], neuron 2: ['1'], and neuron 3: ['10']. 

First, pair the strings of neurons 1 and 2, and then distribute them exhaustively 
to the other neuron's possible and valid strings, concatenating them in the process 
(since they are considered as strings in Python). 

'10' + '1' '101' 

'01' 

and 

'10' 

'01' + '1' -» '011' 

now we have to create a new list from tmp2, which will house the concatenations 
we'll be doing. In this case, 
tmp3 = [101,011] 

next, we pair up tmp3 and the possible and valid strings of neuron 3 

'101' + '10' ->■ '10110' 

'011' 

and 

'101' 

'011' + '10' -> 'OHIO' 
eventually turning tmp3 into 
tmpZ = [10110,01110] 

The final output of the sub-algorithm for the generation of all valid and possible 
spiking vectors is a list, 
tmp3 = [10110,01110] 

As mentioned earlier, & = 2 is the number of valid and possible SkS to be 
expected from r, M77, and Co = [2,1,1] in 77. Thus tmp3 is the list of all possible 
and valid spiking vectors given ^ and ([6| in this illustration. Furthermore, tmp2> 
includes all possible and valid spiking vectors for a given neuron in a given con- 
figuration of an SN P system with all its rules and synapses (interconnections). 
Part IT3 is done ( m — 1) times, albeit exhaustively still so, between the two 
lists/neurons in the pair. 

5 Simulation results, observations, and analyses 

The SNP system simulator (combination of Python and CUDA C) implements the 
algorithms in section [4] earlier. A sample simulation run with the SNP system 77 
is shown below (most of the output has been truncated due to space constraints ) 
with C = [2,1,1] 

****SN p system simulation run STARTS here**** 
Spiking transition Matrix: 

Rules of the form a~n/a~m -> a or a~n ->a loaded: 
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Initial configuration vector: 211 

Number of neurons for the SN P system is 3 

Neuron 1 rules criterion/criteria and total order 

tmpList = [['10', '01'], ['1'], ['10']] 

All valid spiking vectors: allValidSpikVec = 

[['10110', 'OHIO']] 
All generated Cks are allGenCk = 
['2-1-1', '2-1-2', '1-1-2'] 

End of CO 
** 
** 
** 

initial total Ck list is 

['2-1-1', '2-1-2', '1-1-2'] 
Current confVec: 212 
All generated Cks are allGenCk = 
['2-1-1', '2-1-2', '1-1-2', '2-1-3', '1-1-3'] 
** 
** 
** 

Current confVec: 112 

All generated Cks are allGenCk = 

['2-1-1', '2-1-2', '1-1-2', '2-1-3', '1-1-3', 

'2-0-2', '2-0-1'] 

** 

** 



Current confVec: 109 

All generated Cks are allGenCk = ['2-1-1', '2-1-2', 

'1-0-7', '0-1-9', '1-0-8', '1-0-9'] 

** 
** 
** 

No more Cks to use (infinite loop/s otherwise). Stop. 
****SN P system simulation run ENDS here**** 
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That is, the computation tree for SNP system 77 with Co = [2,1,1] went down 
as deep as confVec — 109. At that point, all configuration vectors for all possible 
and valid spiking vectors have been produced. The Python list variable allGenCk 
collects all the C^s produced. In Algorithm [2] all the values of tmp3 are added to 
allGenCk. The final value of allGenCk for the above simulation run is 

allGenCk = ['2-1-1', '2-1-2', '1-1-2', '2-1-3', '1-1-3', '2-0-2', '2-0-1', '2-1-4', '1-1- 
4', '2-0-3', '1-1-1', '0-1-2', '0-1-1', '2-1-5', '1-1-5', '2-0-4', '0-1-3', '1-0-2', '1-0-1', 
'2-1-6', '1-1-6', '2-0-5', '0-1-4', '1-0-3', '1-0-0', '2-1-7', '1-1-7', '2-0-6', '0-1-5', 
'1-0-4', '2-1-8', '1-1-8', '2-0-7', '0-1-6', '1-0-5', '2-1-9', '1-1-9', '2-0-8', '0-1-7', 
'1-0-6', '2-1-10', '1-1-10', '2-0-9', '0-1-8', '1-0-7', '0-1-9', '1-0-8', '1-0-9'] 

It's also noteworthy that the simulation for 77 didn't stop at the 1st stopping 
criteria (arriving at a zero vector i.e. Ck = [0,0,0] ) since 77 generates all natural 
counting numbers greater than 1, hence a loop (an infinite one) is to be expected. 
The simulation run shown above stopped with the 2nd stopping criteria from Sec- 
tion |4| Thus the simulation was able to exhaust all possible configuration vectors 
and their spiking vectors, stopping only since a repetition of an earlier generated 
con fVec/Ck would introduce a loop (triggering the 2nd stopping criteria in sub- 
section |4.1[ ) . Graphically (though not shown exhaustively) the computation tree 
for 77 is shown in Figure [4j 

The confVecs followed by (...) are the confVecs that went deeper i.e. produced 
more CfcS than Figure [4] has shown. 

6 Conclusions and future work 

Using a highly parallel computing device such as a GPU, and the NVIDIA CUDA 
programming model, an SNP system simulator was successfully designed and im- 
plemented as per the objective of this work. The simulator was shown to model 
the workings of an SN P system without delay using the system's matrix rep- 
resentation. The use of a high level programming language such as Python for 
host tasks, mainly for logic and string representation and manipulation of values 
(vector/matrix elements) has provided the necessary expressivity to implement 
the algorithms created to produce and exhaust all possible and valid configuration 
and spiking vectors. For the device tasks, CUDA allowed the manipulation of the 
NVIDIA CUDA enabled GPU which took care of repetitive and highly parallel 
computations (vector-matrix addition and multiplication essentially). 

Future versions of the SNP system simulator will focus on several improve- 
ments. These improvements include the use of an optimized algorithm for matrix 
computations on the GPU without requiring the input matrix to be transformed 
into a square matrix (this is currently handled by the simulator by padding ze- 
ros to an otherwise non-square matrix input). Another improvement would be 
the simulation of systems not of the form (b-3). Byte-compiling the Python/host 
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Fig. 4. The computation tree graphically representing the output of the simulator run 
over n with Co = [2, 1, 1] 



part of the code to improve performance as well as metrics to further enhance 
and measure execution time are desirable as well. Finally, deeper understanding 
of the CUDA architecture, such as inter- thread/block communication, for very 
large systems with equally large matrices, is required. These improvements as well 
as the current version of the simulator should also be run in a machine or setup 
with higher versions of GPUs supporting NVIDIA CUDA. 
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